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ABSTRACT 

Meiotic recombination is an important biological 
process. As a main driving force of evolution, re- 
combination provides natural new combinations of 
genetic variations. Rather than randomly occurring 
across a genome, meiotic recombination takes 
place in some genomic regions (the so-called 
'hotspots') with higher frequencies, and in the 
other regions (the so-called 'coldspots') with lower 
frequencies. Therefore, the information of the 
hotspots and coldspots would provide useful 
insights for in-depth studying of the mechanism of 
recombination and the genome evolution process 
as well. So far, the recombination regions have 
been mainly determined by experiments, which are 
both expensive and time-consuming. With the ava- 
lanche of genome sequences generated in the post- 
genomic age, it is highly desired to develop 
automated methods for rapidly and effectively iden- 
tifying the recombination regions. In this study, a 
predictor, called 'iRSpot-PseDNC, was developed 
for identifying the recombination hotspots and 
coldspots. In the new predictor, the samples of 
DNA sequences are formulated by a novel feature 
vector, the so-called 'pseudo dinucleotide compos- 
ition' (PseDNC), into which six local DNA structural 
properties, i.e. three angular parameters (twist, tilt 
and roll) and three translational parameters (shift, 
slide and rise), are incorporated. It was observed 
by the rigorous jackknife test that the overall 
success rate achieved by iRSpot-PseDNC was 
>82% in identifying recombination spots in 



Saccharomyces cerevisiae, indicating the new pre- 
dictor is promising or at least may become a com- 
plementary tool to the existing methods in this area. 
Although the benchmark data set used to train and 
test the current method was from S. cerevisiae, the 
basic approaches can also be extended to deal with 
all the other genomes. Particularly, it has not 
escaped our notice that the PseDNC approach can 
be also used to study many other DNA-related 
problems. As a user-friendly web-server, iRSpot- 
PseDNC is freely accessible at http://lin.uestc.edu. 
cn/server/iRSpot-PseDNC. 

INTRODUCTION 

Genetic recombination describes the generation of new 
combinations of alleles that occurs at each generation in 
diploid organisms. It is an important biological process 
and results from a physical exchange of chromosomal 
material (1). As a main driving force of evolution, recom- 
bination provides new combinations of genetic variations 
and accelerates the evolution of sexual reproductive or- 
ganisms. A schematic illustration to show the meiotic re- 
combination pathways is given in Figure 1. 

As recombination is crucial to genome evolution, iden- 
tification and characterization of recombination spots are 
substantially important. In the past decades, several global 
mapping studies have been performed to map double- 
strand breaks sites on chromosomes in yeast to determine 
the distribution pattern of recombination regions across 
genome (3-5). They found that meiotic recombination 
events generally concentrate in 1 ~ 2.5 kilobase regions 
and does not occur randomly across the genome. 
Regions that exhibit elevated rates of recombination 
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Figure 1. A schematic drawing to show the meiotic recombination pathways in a DNA system. Recombination is initiated by a double-strand break 
(DSB) catalysed by the Spoil protein (green ball), a relative of archaeal topoisomerase VI. After DSBs are formed, Spoil is removed from the DNA 
molecule (blue helix) and the single-stranded 3' ends are formed. These tails undergo strand invasion of intact homologous duplexes (red helix), 
ultimately yielding mature recombinant products. The repair of meiotic DSB can result in either reciprocal exchange of the chromosome arms 
flanking the break (a crossover) as shown in the left lower panel, or no exchange of flanking arms (a non-crossover or parental configuration) as 
shown in the right lower panel. Adapted from (2). 



relative to a neutral expectation are called recombination 
hotspots, whereas those with low rates of recombination 
are recombination coldspots. Additionally, they also 
found that recombination regions do not share a consen- 
sus sequence. With the rapid increasing number of 
sequenced genomes, it is highly desired to develop 
reliable automated methods for timely identifying the re- 
combination spots. 

Although considerable progress has been made in this 
regard, the computational predictive accuracy of recom- 
bination spots still needs further improvements. The 
existing computational algorithm for recombination 
spots prediction was based on the nucleotide sequence 
contents (6), in which little sequence-order effect was 
taken into account. To improve the prediction quality, it 
is necessary to take into account this kind of effect. 
However, the number of possible patterns for DNA se- 
quences is extremely large, and their lengths vary widely, 
making it difficult to incorporate the sequence-order in- 
formation into a statistical predictor. Facing such a diffi- 
culty, how can we take into account the sequence-order 
effect to improve the prediction quality? If it is not feasible 
to count all the sequence-order information, can we find 
an approximate way to partially take into account it? 
Similar problems were also encountered in computational 
proteomics. To cope with this kind of problems, the 
concept of pseudo amino acid composition (PseAAC) 
was proposed by Chou (7). Since then, the concept of 
PseAAC has penetrated into almost all the fields of com- 
putational proteomics, such as predicting protein 



submitochondrial localization (8), predicting protein 
structural class (9), predicting DNA-binding proteins 
(10), identifying bacterial virulent proteins (1 1), predicting 
metalloproteinase family (12), predicting protein folding 
rate (13), predicting GABA(A) receptor proteins (14), pre- 
dicting protein supersecondary structure (15), predicting 
cyclin proteins (16), classifying amino acids (17), predict- 
ing enzyme family class (18), identifying risk type of 
human papillomaviruses (19), predicting allergenic 
proteins (20), identifying G protein-coupled receptors 
and their types (21) and discriminating outer membrane 
proteins (22), among many others [see a long list of refer- 
ences cited in a review (23)]. Because of its wide and 
increasing usage, in 2012, a powerful software called 
PseAAC-Builder (http://www.pseb.sf.net) (24) was estab- 
lished for generating various special modes of PseAAC, in 
addition to the earlier web-server PseAAC (http://www. 
csbio.sjtu.edu.cn/bioinf/PseAAC) (25) built in 2008. 

Encouraged by the successes of introducing the 
PseAAC approach (7,26) into computational proteomics, 
the present study was initiated in an attempt to propose a 
novel feature vector, called 'pseudo dinucleotide compos- 
ition' (PseDNC), to represent DNA sequence samples by 
incorporating more sequence-order effects so as to 
improve the quality of predicting the recombination spots. 

As summarized in a review (23) and demonstrated by a 
series of recent publications [see, e.g. (27-29)], to establish 
a really useful statistical predictor for a biological system, 
we need to consider the following procedures: (i) construct 
or select a valid benchmark data set to train and test the 
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predictor; (ii) formulate the biological samples with an 
effective mathematical expression that can truly reflect 
their intrinsic correlation with the target to be predicted; 
(hi) introduce or develop a powerful algorithm (or engine) 
to operate the prediction; (iv) properly perform cross- 
validation tests to objectively evaluate the anticipated 
accuracy of the predictor; and (v) establish a user-friendly 
web-server for the predictor that is accessible to the public. 
Below, let us elaborate how to deal with these procedures 
one by one. 

MATERIALS AND METHODS 

Benchmark data set 

The benchmark data set § for the recombination hotspots 
and coldspots was taken from Liu et al. (6). It contains 
490 recombination hotspots and 591 recombination 
coldspots, as can be formulated by 

§ = § + ur (i) 

where U represents the symbol for 'union' in the set 
theory; the subset § + contains the recombination 
hotspots only, whereas §T recombination coldspots 
only. For the convenience of readers, the 490 sequences 
in § + and 591 sequences in §~ are given in the 
Supplementary Information SI. 

PseDNC 

Suppose a DNA sequence D with L nucleic acid residues; 
i.e. 

D= R 1 R 2 R 3 R 4 R 5 R6R7 . . .R/. (2) 

where Ri represents the nucleic acid residue at the 
sequence position 1, R 2 the nucleic acid residue at 
position 2 and so forth. If the feature vector of the 
DNA sequence is formulated by its nucleic acid compos- 
ition (NAC), we have 

D=[./(A) /(C) /(G) ,/(T)] T (3) 

where /(A), /(C), /(G), and /(T) are the normalized occur- 
rence frequencies of adenine (A), cytosine (C), guanine (G) 
and thymine (T), respectively, in the DNA sequence; the 
symbol T is the transpose operator. As we can see from 
Equation 3, all the sequence-order information is missed if 
using NAC to represent a DNA sequence. If using the 
DNC to represent the DNA sequence, instead of the 
four components as shown in Equation 3, the correspond- 
ing feature vector will contain 4x4= 16 components, as 
given below 

D=[./(AA) /(AC) ./(AG) /(AT) ■■■ ,/(TT)] T 

(4) 

=[A h h A ■■■ Aef 

where /i = /(AA) is the normalized occurrence frequency 
of AA in the DNA sequence; f 2 =j[AC), that of AC; 
/ 3 = /(AG), that of AG and so forth. Although the most 
contiguous local sequence-order information is included in 



Equation 4, none of the global sequence-order informa- 
tion is reflected by the formulation. DNC is the most 
simple pseudo NAC, or PseNAC, according to the termin- 
ology similar to that used in (7). 

To incorporate the global sequence-order information 
into the feature vector for the DNA sequence, let us 
consider the following approach. As shown in Equation 
2, the first dinucleotide in the DNA sequence is R1R2, the 
second dinucleotide is R2R3 and so forth; the last one is 
R L .iR L . Thus, by following the similar procedures as 
described in (7) to reflect the global sequence-order infor- 
mation of a protein with a set of sequence- 
order-correlated factors, for the DNA sequence of 
Equation 2, we also have the corresponding factors as 
defined below: 

L-2 

Q\ — 2^2 X! ®(R;'R;+l.Ri+lR;+2) 
i=\ 
L-3 

#2 — 7^3 ®(Ri'R/+i>R/+2R+3) 

L-4 (X < L) (5) 

63 — 2^ ®(R/R/+i>R;+3Ri+4) 

(=1 

L-y-x 

&X — X-l-A ^ ®(R'R'+bR'+AR;+A+l) 

/=1 

where 6 X is called the first-tier correlation factor that 
reflects the sequence-order correlation between all the 
most contiguous dinucleotide along a DNA sequence 
(Figure 2a); 6 2 , the second-tier correlation factor 
between all the second most contiguous dinucleotide 
(Figure 2b); # 3 , the third-tier correlation factor between 
all the third most contiguous dinucleotide (Figure 2c) and 
so forth. 

In Equation 5, the parameter X is an integer, represent- 
ing the highest counted rank (or tier) of the correlation 
along a DNA sequence, and the correlation function is 
given by 

1 ^ 

©(R,R, + i,R 7 R 7+1 ) = -J2 [^(R;R/+i) - ^(R/Ry+i)] 2 

V u=\ 

(6) 

where fi is the number of local DNA structural properties 
considered that is equal to 6 in the current study as will be 
explained later in the text; P„(R,R,+i), the numerical value 
of the w-th (u — 1,2, ■ ■ -,/x) DNA local property for the 
dinucleotide R,R,+i at position i and P u (RjRj+{), the 
corresponding value for the dinucleotide R/R,+i at 
position j. 

DNA local structural properties 

Multiple lines of evidences have indicated that some local 
DNA structural properties, i.e. angular parameters (twist, 
tilt and roll) and translational parameters (shift, slide and 
rise), have important roles in biological processes, such as 
protein-DNA interactions, formation of chromosomes 
and higher-order organization of the genetic material 
(30-32). Accordingly, these six structural properties 



e68 Nucleic Acids Research, 2013, Vol. 41, No. 6 



Page 4 of 9 



(a) 

0(R 2 R 3 ,R 3 R4) ©(R^RsRe) 
/ % / \ 

i — * — n — *— n — * — n — * — i 

R i R2R3 R4R5 ReRyRs • • • • Rl 
^ " * ' x " 

V 

0(RiR 2 ,R 2 R 3 ) 0(R 3 R4,R4R5) 

(b) 

0(R 2 R 3 ,R 4 R 5 ) 0(R 3 R4,R5R 6 ) 

1 — * m — if — ' r \ A — \ 

Ri R2R3R4R5R6R7R8. • • Rl 

0(R 1 R 2 ,R 3 R 4 ) ©(RiRsReR?) 



(c) 



0(R 2 R 3 ,R 5 R 6 ) 0(R4R 5 ,R 7 R8) 



RjR2R^R^sR^7R8 • • • Rl 



0(RiR 2s R4R5) 0(R 3 R4,R 6 R7) 

Figure 2. A schematic illustration to show the correlations of dinucleo- 
tides along a DNA sequence, (a) The first-tier correlation reflects the 
sequence-order mode between all the most contiguous dinucleotide. 
(b) The second-tier correlation reflects the sequence-order mode 
between all the second-most contiguous dinucleotide. (c) The third-tier 
correlation reflects the sequence-order mode between all the third-most 
contiguous dinucleotide. 

might have impact 011 DNA binding of regulatory 
proteins, either directly by hampering or favoring 
complex formation or indirectly through the modulation 
of the chromatin structures and hence the DNA 
accessibility (33). Listed in Table 1 are their original 
numerical values derived from (32) for twist Pi(R,-R,-+i), 
tilt P 2 (R;R;+i), roll P 3 (R ! Ri +1 ), shift P 4 (R,R, +1 ), slide 
^(R/Rz+i), and rise ^(R/Rm), respectively, where 
R/R/+1 represents the 16 possible dinucleotides AA, AC, 
AG, AT, . . ., TT. It was these six DNA local physical 
structural properties that were to be used as correlation 
functions to derive the PseDNC for the current study. 
Meanwhile, it is also self-evident why /x = 6 in Equation 
6 for the current case. 

Before substituting into Equation 6, the original values 
as listed in Table 1 for .P„(R,R,+i) (« = 1,2, •••,6), they 
were all subjected to a standard conversion (26), as 
described by the following equation 



A/(R/R;+l) = 



P U (R ; R /+1 )- < P u > 
SD( J P„) 



where the symbol < > means taking the average of the 
quantity therein for 16 different dinucleotides (cf. 
Equation 4), and SD means the corresponding standard 
deviation. The converted values obtained by Equation 7 
will have a zero mean value for the 16 different 
dinucleotides and will remain unchanged if going 
through the same conversion procedure again. Listed in 
Table 2 are the values of ^(RjRj+i) (u — 1,2, ■ ■ ■ ,6) 
obtained via the standard conversion of Equation 7 
from those of Table 1. 

Now we can see from Figure 2 that the sequence-order 
effect of a DNA sequence can be, to some extent, reflected 
through a set of sequence-correlation factors 0 { 9 2 , 0 3 , ■ ■ ■, 
6 X , as clearly defined by Equations 5 and 6. Similar to the 
procedure as described in (7) for converting the amino 
acid composition to the PseACC, let us augment the 
DNC of Equation 4 to the PseDNC as given later in the 
text 



D = [d\ d 2 ■ ■■ d l6 d l6+ i 
where 



r 



d, 



(1 < k < 16) 
(17 < k < 16+A) 



(8) 



(9) 



(7) 



where ft (fe = 1,2, - • - ,16) are the same as those in 
Equation 4, 6j (j — 1,2, ■■ ■ ,X) are given by Equation 5, X 
is the number of the total counted ranks (or tiers) of the 
correlations along a DNA sequence and w is the weight 
factor. The concrete values for k and w will be discussed 
further. Thus, instead of a 16-D (dimensional) vector (cf. 
Equation 4), the DNA sequence is now formulated by a 
(16+1)— D vector as shown in Equation 8. It is through the 
additional X correlation factors (Figure 2) that not only 
considerable global sequence-order effects can be 
incorporated but the DNA sequences with extreme 
difference in length can also be converted into a set of 
feature vectors with a same dimension. The latter is an 
important pre-requisite for formulating the statistical 
samples because many powerful classification engines, 
such as Covariant Discriminant (34,35), Support Vector 
Machine (SVM) (36) and K-Nearest Neighbor (37-39) 
algorithms, require the input to be a set of digital 
vectors with a fixed number of components. 

SVM 

SVM is an effective method for supervised pattern 
recognition and has been widely used in the realm of 
bioinformatics [see, e.g. (14,40^45)]. The basic idea of 
SVM is to transform the data into a high dimensional 
feature space and then determine the optimal separating 
hyperplane. A brief introduction about the formulation of 
SVM was given in (46). In this study, the SVM 
implementation was based on the freely available 
package LIBSVM 2.84 written by Chang and Lin (47). 
Because of its effectiveness and speed in training 
process, the radial basis kernel function was used to 
obtain the best classification hyperplane. The 
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CA 


0.016 


0.025 


0.017 


1.07 


1.78 


6.38 


cc 


0.026 


0.042 


0.019 


1.43 


1.65 


8.04 


CG 


0.014 


0.026 


0.016 


1.08 


2.00 


6.23 


CT 


0.031 


0.037 


0.019 


1.46 


2.03 


7.08 


GA 


0.025 


0.038 


0.020 


1.32 


1.93 


8.56 


GC 


0.025 


0.036 


0.026 


1.20 


2.61 


9.53 


GG 


0.026 


0.042 


0.019 


1.43 


1.65 


8.04 


GT 


0.036 


0.038 


0.023 


1.32 


3.03 


8.93 


TA 


0.017 


0.018 


0.016 


0.72 


1.20 


6.23 


TC 


0.025 


0.038 


0.020 


1.32 


1.93 


8.56 


TG 


0.016 


0.025 


0.017 


1.07 


1.78 


6.38 


TT 


0.026 


0.038 


0.020 


1.69 


2.26 


7.65 



a In this table, the following symbols were used to represent the six physical structures of dinucleotide (32): .Pi for 'twist, P2 for 'tilt', P3 for 'roll', P4 
for 'shift', Ps for 'slide' and Pf, for 'rise'. 



Table 2. The normalized values for the six DNA dinucleotide physical structures 
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^(R/R.+i) 


^(R/R+i) 


^(R/Rz+i) 


AfR/Rz+i) 


^(RiRz+i) 


^(R/Rz+i) 


AA 


0.06 


0.5 


0.27 


1.59 


0.11 


-0.11 


AC 


1.50 


0.50 


0.80 


0.13 


1.29 


1.04 


AG 


0.78 


0.36 


0.09 


0.68 


-0.24 


-0.62 


AT 


1.07 


0.22 


0.62 


-1.02 


2.51 


1.17 


CA 


-1.38 


-1.36 


-0.27 


-0.86 


-0.62 


-1.25 


CC 


0.06 


1.08 


0.09 


0.56 


-0.82 


0.24 


CG 


-1.66 


-1.22 


-0.44 


-0.82 


-0.29 


-1.39 


CT 


0.78 


0.36 


0.09 


0.68 


-0.24 


-0.62 


GA 


-0.08 


0.5 


0.27 


0.13 


-0.39 


0.71 


GC 


-0.08 


0.22 


1.33 


-0.35 


0.65 


1.59 


GG 


0.06 


1.08 


0.09 


0.56 


-0.82 


0.24 


GT 


1.50 


0.50 


0.80 


0.13 


1.29 


1.04 


TA 


-1.23 


-2.37 


-0.44 


-2.24 


-1.51 


-1.39 


TC 


-0.08 


0.5 


0.27 


0.13 


-0.39 


0.71 


TG 


-1.38 


-1.36 


-0.27 


-0.86 


-0.62 


-1.25 


TT 


0.06 


0.5 


0.27 


1.59 


0.11 


-0.11 



"See footnote a of Table 1 for further explanation. 



regularization parameter C and the kernel width 
parameter y were determined via an optimization 
procedure using a grid search approach, and their actual 
values thus obtained for the current study were C = 32 
and y = 0.5. 

iRSpot-PseDNC and its parameters 

The predictor obtained via the aforementioned procedures 
is called iRSpot-PseDNC. The PseDNC as formulated in 
Equations 8 and 9 contains two uncertain parameters X 
and ic. The former represents the total number of 
correlation ranks counted (cf. Equation 5 and Figure 2), 
which is an integer and should be smaller than the length 
of any of the DNA sequences involved in this study, 
whereas the latter is the weight factor ranged from 0 to 
1 (26). Generally speaking, the greater the value of X, the 



more sequence-order effects will be incorporated. 
However, if the value of X is too large, it might cause 
the overfittiiig problem (48) or 'high dimension 
disaster' (49). Preliminary tests indicated that in using 
the iRSpot-PseDNC predictor on the benchmark data 
set § (Supplementary Information SI), a peak was 
observed for the overall accuracy A (cf. Equation 1 1 ) or 
Acc (cf. Equation 12) when X = 3 and w — 0.05. 
Accordingly, the two numerical values were 
respectively used for the two uncertain parameters in 
iRSpot-PseDNC. 

RESULTS AND DISCUSSIONS 

One of the important procedures in developing a useful 
statistical predictor (23) is to objectively evaluate its 
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performance or anticipated success rate. Now let us 
address this problem. 



Criteria for performance evaluation 

To provide a more intuitive and easier-to-understand 
method to measure the prediction quality, the criteria 
proposed in (50) was adopted here. According to that 
criteria, the rates of correct predictions for the 
recombination hotspots in data set § + and the 
recombination coldspots in data set ST are respectively 
defined by (cf. Equation 1) 



A+ = 
A" = 



N~-Nl 



, for the recombination hotspots 
, for the recombination coldspots 



(10) 



where N + is the total number of the recombination 
hotspots investigated, whereas Nt the number of the 
recombination hotspots incorrectly predicted as 
the coldspots; N" the total number of the recombination 
coldspots investigated, whereas At; the number of 
the recombination coldspots incorrectly predicted as the 
hotspots. The overall success prediction rate is given by 
(51) 



A 



A + N + +A~N~ 
A + +A~ 



N++N+ 
N + +N~ 



(11) 



It is obvious from Equations 10 and 1 1 that, if and only 
if none of the recombination hotspots and the 
recombination coldspots are mispredicted, i.e. 
Nt = At; = 0 and A + = A~ = 1, we have the overall 
success rate A = 1. Otherwise, the overall success rate 
would be <1. 

On the other hand, it is instructive to point out 
that the following equation set is often used in 
literatures for examining the performance quality of a 
predictor 



TP 
TP+FN 

TN 
TN+FP 

TP+TN 



Sn = 
Sp = 

ACC — tp+jn+FP+FN 

jyj££ _ (TPxTN)-(FPxFN) 

~~ ^(TP+FP)(TP+FN)(TN+FP)(TN+FN) 



(12) 



where TP represents the true positive; TN, the true 
negative; FP, the false positive; FN, the false negative; 
Sn, the sensitivity; Sp, the specificity; Ace, the accuracy; 
MCC, the Mathew's correlation coefficient. 

The relations between the symbols in Equation 1 1 and 
those in Equation 12 are given by 



TP = A+ 
TN = N~ 



A+ 
-AT 



FP = N 
FN 



(13) 



A+ 



Substituting Equation 13 into Equation 12 and also 
noting Equation 11, we obtain 



Sn = 
Sp = 
Acc 

MCC 



1 N+ 
1 N- 

A=l 



JrTTF 



-( 



(14) 



Obviously, when Nt = 0, meaning none of the 
recombination hotspots was mispredicted to be a 
coldspots, we have the sensitivity Sn=l, whereas 
Nt = N + , meaning that all the recombination hotspots 
were mispredicted to be the coldspots, we have the 
sensitivity Sn = 0. Likewise, when N+ = 0 , meaning 
none of the recombination coldspots was mispredicted, 
we have the specificity Sp = 1, whereas AT~ = N~ 
meaning all the recombination coldspots were incorrectly 
predicted as recombination hotspots, we have the 
specificity Sp = 0. When Nt = N+ — 0, meaning that 
none of the recombination hotspots in the data set 
and none of the recombination coldspots in S~ was 
incorrectly predicted, we have the overall accuracy 
Acc = A = 1, whereas Nt — N + and At; — N~, meaning 
that all the recombination hotspots in the data set 5^ and 
all the recombination coldspots in iT were mispredicted, 
we have the overall accuracy Acc = A = 0. The MCC 
correlation coefficient is usually used for measuring the 
quality of binary (two-class) classifications. When 
Nt = N+ — 0, meaning that none of the recombination 
hotspots in the data set 5 + and none of the recombination 
coldspots in 5~ was mispredicted, we have Mcc = 1; when 
Nt = N + /2 and A+ = N~/2, we have Mcc = 0, meaning 
no better than random prediction; when Nt = N + and 
At: = A~ we have MCC = — 1, meaning total disagree- 
ment between prediction and observation. As we can see 
from the aforementioned discussion, it is much more 
intuitive and easier to understand when using Equation 
14 to examine a predictor for its sensitivity, specificity, 
overall accuracy and Mathew's correlation coefficient. 



Cross-validation 

In literatures, the following three cross-validation 
methods are often used to evaluate the quality of a 
predictor: independent data set test, subsampling (K-fold 
cross-validation) test and jackknife test. However, as 
elaborated by an analysis in (52) and demonstrated by 
Equations 28-32 of (23), among the three cross-validation 
methods, the jackknife test is deemed the least arbitrary 
and most objective because it can always yield a unique 
result for a given benchmark data set, and hence has been 
widely recognized and increasingly used by investigators 
to examine the quality of various predictors [see, e.g. 
(11,16,21,22,29,53-57)]. Accordingly, the jackknife test 
was also adopted in this study to examine the anticipated 
success rates of the current predictor. In the jackknife test, 
all the samples in the benchmark data set § will be singled 
out one by one and tested by the predictor trained by the 
remaining samples. During the jackknifing process, both 
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the training data set and testing data set are actually open, 
and each sample will be in turn moved between the two. 

The results obtained with iRSpot-PseDNC on the 
benchmark data set § of Supplementary Information SI 
by the jackknife test are given in Table 3, where for 
facilitating comparison, the corresponding results by the 
IDQD predictor (6) on the same benchmark data set are 
also given. As indicated in Table 3, the results reported by 
Liu et al. (6) were derived by the 5-fold cross-validation 
test. As elucidated in (23), this would make their test 
without a unique result as demonstrated later in the text. 
For the current case, the benchmark data set § consists of 
§ + and §~, where § + contains 490 recombination 
hotspots, and §T contains 591 recombination coldspots. 
Substituting these data into Equations 28 and 29 of (23) 
with M = 2 (number of groups for classification) and 
r = 5 (number of folds for cross-validation), we obtain 



n 



490! 



[490 - Int(490/5)]!Int(490/5)! 
591! 

' [591 - Int(591/5)]!Int(591/5)! 
490! 591! 



(490- 98)!98! (591 - 118)! 118! 



1.17 x 10 



232 



(15) 



where the symbol Int is the integer-truncating operator 
meaning to take the integer part for the number in the 
bracket right after it. The result of Equation 1 5 indicates 
that the number of possible combinations of taking one- 
fifth samples from each of the two subsets, § + and ST, for 
conducting the 5-fold cross-validation will be >10 232 , 
which is an astronomical figure, too large to be practical. 
Therefore, in their study (6), Liu et al. only randomly 
picked one of ~1.17 x 10 232 possible combinations (cf. 
Equation 15) to perform the 5-fold cross-validation. To 
make the comparison between iRspot-PseNDC and 
IDQD (6) with the same test method, we also randomly 
picked one of the possible combinations from the same 
benchmark data set to perform the 5-fold cross-validation 
test with iRspot-PseNDC, and the corresponding results 
thus obtained are given in Table 3 as well. 

As we can see from the table, not only the overall 
accuracy (Acc) achieved by iRSpot-PseDNC using the 
5-fold cross validation test is remarkably higher 
than that by the IDQD (6) but the overall accuracy 
achieved by iRSpot-PseDNC using the rigorous jackknife 
test is also higher than that by the IDQD. Besides the 
overall accuracy, the MCC rates achieved by the 
iRSpot-PseDNC predictor derived from both 5-fold 



cross-validation and jackknife tests are also higher than 
those by the IDQD predictor. 

To further demonstrate its performance, we used iRSpot- 
PseDNC to identify the 452 experimentally annotated 
recombination hotspots by Pan et al. (58) for the 
S. cerevisiae chromosome IV. The results are given in the 
Supplementary Information S2, from which we can see that 
347 outcomes by the predictor were consistent with the 
experimental observations. The overall success rate was 
76.77%, indicating that the method as proposed in this 
article is promising in identifying recombination hot/cold 
spots, or can at the very least play a complementary role to 
the existing method in this area. 

Web-server guide 

For the convenience of the vast majority of experimental 
scientists, let us give a step-by-step guide on how to use the 
iRSpot-PseDNC web-server to get their desired results 
without the need to follow the complicated mathematic 
equations that were presented just for the integrity in 
developing the predictor. The detailed steps are as follows. 

Step 1 

Open the web server at http://lin.uestc.edu.cn/server/ 
iRSpot-PseDNC and you will see the top page of 
iRSpot-PseDNC on your computer screen, as shown in 
Figure 3. Click on the Read Me button to see a brief 
introduction about the predictor and the caveat when 
using it. 



iRSpot-PseDNC: Predicting recombination spots 
with pseudo dinucleotide composition 




Figure 3. A semi-screenshot to show the top page of the iRSpot- 
PseDNC web-server. Its website address is at http://lin.uestc.edu.cn/ 
server/iRSpot-PseDNC. 



Table 3. A comparison of between iRSpot-PseDNC with the existing method 



Predictor 


Test method 


Sn (%) 


Sp (%) 


Acc (%) 


MCC 


iRSpot-PseDNC 


Jackknife 


73.06 


89.49 


82.04 


0.638 




5-fold cross 


81.63 


88.14 


85.19 


0.692 


IDQD b 


5-fold cross 


79.40 


81.00 


80.30 


0.603 



"The parameters used: X = 3 and w = 0.05 for Equation 9; C = 32 and y = 0.5 for the LIBSVM operation engine (47). 
b From Liu et al. (6). 
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Step 2 

Either type or copy/paste the query DNA sequence into 
the input box at the center of Figure 3. The input sequence 
should be in the FASTA format. A sequence in FASTA 
format consists of a single initial line beginning with a 
greater- than symbol ('>') in the first column, followed 
by lines of sequence data. The words right after the '>' 
symbol in the single initial line are optional and only used 
for the purpose of identification and description. All lines 
should not be longer than 120 characters and usually do 
not exceed 80 characters. The sequence ends if another line 
starting with a '>' appears; this indicates the start of 
another sequence. Example sequences in FASTA format 
can be seen by clicking on the Example button right above 
the input box. 

Step 3 

Click on the Submit button to see the predicted result. For 
example, if you use the query DNA sequences in the 
Example window as the input, after clicking the Submit 
button, you will see the following shown on the screen of 
your computer: the outcome for the first query sample is 
'recombination hotspot'; the outcome for the second 
query sample is 'recombination coldspot'. All these 
results are fully consistent with the experimental 
observations as summarized in the Supplementary 
Information SI. It takes a few seconds for the 
aforementioned computation before the predicted result 
appears on your computer screen; the more number of 
query sequences and longer of each sequence, the more 
time it is usually needed. 

Step 4 

Click on the Citation button to find the relevant papers 
that document the detailed development and algorithm of 
iRSpot-PseDNC. 

Step 5 

Click on the Data button to download the benchmark 
data sets used to train and test the iRSpot-PseDNC 
predictor. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Data sets 1 and 2. 
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